Word Segmentation of Handwritten Dates in Historical Documents by Combining Semantic A-Priori-Knowledge with Local Features

نویسندگان

  • Markus Feldbach
  • Klaus D. Tönnies
چکیده

The recognition of script in historical documents requires suitable techniques in order to identify single words. Segmentation of lines and words is a challenging task because lines are not straight and words may intersect within and between lines. For correct word segmentation, the conventional analysis of distances between text objects needs to be supplemented by a second component predicting possible word boundaries based on semantical information. For date entries, hypotheses about potential boundaries are generated based on knowledge about the different variations as to how dates are written in the documents. It is modeled by distribution curves for potential boundary locations. Word boundaries are detected by classification of local features, such as distances between adjacent text objects, together with location-based boundary distribution curves as a-priori knowledge. We applied the technique to date entries in historical church registers. Documents from the 18th and 19th century were used for training and testing. The data set consisted of 674 word boundaries in 298 date entries. Our algorithm found the correct separation under the best four hypotheses for a word sequence in 97% of all cases in the test data set.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Connected Component Based Word Spotting on Persian Handwritten image documents

Word spotting is to make searchable unindexed image documents by locating word/words in a doc-ument image, given a query word. This problem is challenging, mainly due to the large numberof word classes with very small inter-class and substantial intra-class distances. In this paper, asegmentation-based word spotting method is presented for multi-writer Persian handwritten doc-...

متن کامل

Radial Line Fourier Descriptor for Segmentation-free Handwritten Word Spotting

Automatic recognition of historical handwritten manuscripts is a daunting task due to paper degradation over time. Recognition-free retrieval or word spotting is popularly used for information retrieval and digitization of the historical handwritten documents. However, the performance of word spotting algorithms depends heavily on feature detection and representation methods. Although there exi...

متن کامل

A Search Engine for Handwritten Documents

The design and functionality of a versatile search engine on handwritten documents is described. Documents are indexed using global image features, e.g., stroke width, slant, word gaps, as well local features that describe shapes of characters and words. Image indexing is done automatically using page analysis, page segmentation, line separation, word segmentation and recognition of characters ...

متن کامل

Local Binary Pattern for Word Spotting in Handwritten Historical Document

Digital libraries store images which can be highly degraded and to index this kind of images we resort to word spotting as our information retrieval system. Information retrieval for handwritten document images is more challenging due to the difficulties in complex layout analysis, large variations of writing styles, and degradation or low quality of historical manuscripts. This paper presents ...

متن کامل

Finding words in alphabet soup: Inference on freeform character recognition for historical scripts

This paper develops word recognition methods for historical handwritten cursive and printed documents. It employs a powerful segmentation-free letter detection method based upon joint boosting with histogram-of-gradients features. Efficient inference on an ensemble of hidden Markov models can select the most probable sequence of candidate character detections to recognize complete words in ambi...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2003